EricP wrote:
> MitchAlsup wrote:
>>
> That result feeds to the Log2 Parser which selects up to 6 instructions
> from those source bytes.
> Fetch Line Buf 0 (fully assoc index)
> Fetch Line Buf 1
> Fetch Line Buf 2
> Fetch Line Buf 3
> v v
> 32B Blk1 32B Blk0
> v v
> Alignment Shifter 8:1 muxes
> v
> Log2 Parser and Branch Detect
> v v v v v v
> I5 I4 I3 I2 I1 I0
Been thinking of this in the background the last month. It seems to me that
a small fetch-predictor is in order.
This fetch-predictor makes use of the natural organization of the ICache as
a matrix of SRAM macros (of some given size:: say 2KB) each SRAM macro having
a ¼ Line access width. Let us call this the horizontal direction. In the
vertical direction we have sets (or ways if you prefer).
Each SRAM macro (2KB) has 128-bits and 128-words so we need a 7-bit index.
Each SRAM column has {2,3,4,...} SRAM macros {4=16KB ICache}; so we need
{2,3,..}-bits of set-index.
Putting 4 of these index sets together gives us a (7+3)×4 = 40-bit fetch-
predictor entry, add a few bits for state and control. {{We may need to
add a field used to access the fetch-predictor for the next cycle}}.
We are now in a position to access 4×¼ = 1 cache line (16 words) from the
matrix of SRAM macros.
Sequential access:
It is easy to see that one can access 16 words (16 potential instructions)
in a linear sequence even when the access crosses a cache line boundary.
Non-sequential access:
Given a 6-wide machine (and known instruction statistics wrt VLE utilization)
and the assumption of 1 taken branch per issue-width:: the fetch-predictor
accesses 4 SRAM macros indexing the macro with the 7-bit index, and choosing
the set from the 3-bit index. {We are accessing a set-associative cache as if
it were directly mapped.}
Doubly non-sequential access:
There are many occurrences where there are a number of instructions on the
sequential path, a conditional branch to a short number of instructions on
the alternate path ending with a direct branch to somewhere else. We use
the next fetch-predictor access field such that this direct branch does not
incur an additional cycle of fetch (or execute) latency. This direct branch
can be a {branch, call, or return}
Ramifications:
When instructions are written into the ICache, they are positioned in a set
which allows the fetch-predictor to access the sequential path of instructions
and the alternate path of instructions.
All instructions are always fetched from the ICache, which has been organized
for coherence by external SNOOP activities, so there is minimal excess state
and no surgery at context switching or the like.
ICache placement ends up dependent on the instructions being written in accord
with how control flow arrived at this point (satisfying the access method
above).
This organization satisfies several "hard" cases::
a) 3 ST instructions each 5 words in size: the ICache access supplies 16 words
all 4×¼ accesses are sequential but may span cache line boundaries and set
placements. These sequences are found in subroutine prologue setting up local
variables with static assignments on the stack. The proposed machine can only
perform 3 memory references per cycle, so this seems to be a reasonable balance.
b) One can process sequential instruction up to a call and several instructions
at the call-target in the same issue cycle. The same can transpire on return.
c) Should a return find a subsequent call (after a few instructions) the EXIT
instruction can be cut short and the ENTER instruction cut short because all
the preserved registers are already where they need to be on the call/return
stack; taking fewer cycles wandering around the call/return tree.
So:: the fetch-predictor contains 5 accesses, 4 to ICache of instructions and
1 to itself for the next fetch-prediction.
{ set[0] column[0] set[1] column[1] set[2] column[2] set[3] column[3] next}
| +-------+ | +-------+ | +-------+ | +-------+ | +-------+
| | | +-->| | | | | | | | +->| |
| +-------+ +-------+ | +-------+ | +-------+ +-------+
+--> | | | | | | | | | |
+-------+ +-------+ | +-------+ | +-------+
| | | | +--> | | +--> | |
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| | | |
V V V V
inst[0] inst[1] inst[2] inst[3]
The instruction groups still have to be "routed" into some semblance of order
but this can take place over the 2 or 3 decode cycles.
All of the ICache tag checking is performed "later" in the pipeline, taking
tag-check and selection multiplexing out of the instruction delivery path.